Machine learning of factors for improving oyster hatchery production

Authors

Vyacheslav Lyubchich

Matthew W. Gray

Greg M. Silsbe

Published

2023-09-07

Our research focuses on investigating the variability in oyster hatchery production yields and identifying factors that influence these outcomes. By understanding the predictors of hatchery production outcomes, we aim to enhance the efficiency of oyster hatchery operations.

Improving the efficiency of hatchery production is crucial as it not only impacts profitability but also addresses concerns among growers regarding the availability of oyster seeds. To achieve this, we gathered production data from the Horn Point Laboratory Oyster Hatchery and collected relevant information on weather and water conditions near the hatchery.

Leveraging the power of machine learning, we employed predictive modeling techniques to identify key indicators of low and high aquaculture yields. Yield was defined as the percent of pediveliger larvae produced from the fertilized eggs initially stocked at the beginning of the production process. The models we developed can generate forecasts for specific conditions and provide valuable insights to hatcheries on implementing appropriate practices to enhance cost-effectiveness.

We are pleased to make our findings publicly available, encouraging comments and feedback from the research and oyster growers community. Should you have any thoughts or suggestions, please don’t hesitate to share them as a comment below or contact the authors.

1 Background

The success of oyster aquaculture, fisheries augmentation, and restoration in the Chesapeake Bay heavily relies on the hatchery production of larvae. Despite the expertise of skilled hatchery staff, shellfish hatcheries frequently encounter periods of suboptimal larval growth and inconsistent production levels. The Horn Point Laboratory Oyster Hatchery (HPLOH) is no exception, with highly variable yield rates, ranging from 5% to 17% on average, and occasional broods exceeding 50% yield (Gray et al. 2022).

In addition to reduced production levels, HPLOH and other hatcheries commonly face acute mortality events known as crashes. Crashes have been an ongoing issue within the global shellfish aquaculture industry (Walker 2017) and affect various cultivated shellfish species (Jones 2006). These crashes can occur suddenly, leading to the loss of entire batches of larvae overnight. Observations have revealed larvae covered in bacterial coatings or filled with ciliates, making it challenging to determine whether these factors caused the crash or opportunistic invaders took advantage of a compromised batch (Estes et al. 2004). Besides bacterial contamination, acute fluctuations in water quality, such as harmful algal blooms or abrupt changes in salinity or oxygen levels, can also disrupt culture conditions and critical components of the production process, including broodstock conditioning or algae cultures (Jones 2006).

To ensure suitable production conditions, hatcheries regularly monitor the quality of their source water. When conditions are suboptimal, facilities can make adjustments, such as adding salt, to mitigate adverse environmental effects and resume production. However, in many cases, the causes of crashes or production inefficiencies remain unidentified. Hatchery staff often lack the time and resources to investigate the root causes thoroughly. As a result, the most common approach to managing crashes is to flush out “bad” water, clean tanks, and hope that refilling them with “good” water will enable production to resume. Unfortunately, adverse conditions can persist for extended periods (Barton et al. 2012; Gray et al. 2022), severely limiting oyster production regionally and incurring significant financial losses for the industry (e.g., over $110 million during the “Seed Crisis” in the Pacific Northwest, Ekstrom et al. 2015). Therefore, there is a critical need to identify predictors for hatchery yield, particularly regarding low yield, to prevent costly crashes. Tools that enhance production reliability would increase the profitability of public and private hatcheries by improving resource utilization and minimizing waste. Moreover, ensuring a more reliable seed supply would alleviate concerns among growers in Maryland and beyond, creating greater opportunities for industry expansion.

2 Data

2.1 Oyster production data

We supplemented the long-term hatchery production data used by Gray et al. (2022) with more recent data from the HPLOH. The combined dataset spanned the years 2011 to 2021 and provided detailed information on broodstock, spawning, and larval production in the hatchery.

The primary focus of our study was the hatchery yield, which was defined as the percentage of pediveliger oyster larvae (i.e. eyed larvae) out of all the fertilized eggs added into a tank: \[ \text{Yield} = \frac{\text{Eyed larvae produced}}{\text{Eggs added to tank}} \times 100\%. \] To gain a clearer understanding of the factors influencing production outcomes, we specifically examined broods that were not mixed during the production process. Consequently, we did not include information on mixed broods in our analysis.

During the data cleaning process, several measures were taken:

  • All yields were set to zero when the HPLOH indicated that the brood was “dumped” due to a production crash.
  • Records with HPLOH notes indicating “no eggs” (7 records) or those lacking information on the number of eyed larvae (9 records) were removed.
  • Records displaying an unusually low production rate (number of days to first eyed oysters) were removed. We identified three such records with production rates below 9 days, which typically indicate rushed production for a specific buyer of larvae.

From the HPLOH records from different stages of the production process, we considered various predictors of yields:

  1. Broodstock collection:
  • Name of the oyster bar of collected broodstock.
  1. Conditioning:
  • Average conditioning pH (NBS units),
  • Average conditioning salinity (ppt), and
  • Average conditioning temperature (°C).
  1. Spawning:
  • Average shell height of females (mm),
  • Average shell height of males (mm),
  • Fecundity (millions of eggs per female),
  • Gonadal index (1 to 4), and
  • Stimulation used (female / male / other / none).
  1. Larval culture:
  • Week of year when production started (1 to 52) and
  • Carbonate tank buffering (yes or no).

We narrowed down the list of predictors compared to Gray et al. (2022) to focus on identifying explanatory relationships that could be used to predict yields before the production process commenced. For instance, we did not include the year of broodstock collection as a predictor due to its limited explanatory value. Also, the percentage survival to prodissochonch II stage (a.k.a. D-hinge) was not used since this information becomes available only later in the production process.

To determine the date when each new brood’s tank was filled with ambient water, we used the date of the first drain and the corresponding oyster age as reported by the HPLOH: \[ \text{Date:time of first fill} = \text{Date:time of first drain} - \text{Age at the first drain (hours)}. \] This calculated date served as a reference point to extract information on environmental conditions near the hatchery.

2.2 Environmental data

We used different information sources to obtain information on several environmental variables recorded on the day of tank filling. To account for potential delayed effects on oyster hatchery production outcomes, these variables were utilized both in their original form and with a lag of 1 to 7 days, excluding the variables representing winter conditions.

Goose’s Reef buoy

We used quality-controlled records from the Goose’s Reef buoy, which is part of the NOAA’s Chesapeake Bay Interpretive Buoy System (CBIBS). The water quality variables included:

  • Chlorophyll concentration (ug/L),
  • Dissolved oxygen concentration (mg/L),
  • Salinity (PSU),
  • Turbidity (NTU), and
  • Water temperature (°C)

In addition to water quality, the buoy also recorded atmospheric conditions, including:

  • Air temperature (°C),
  • Wind direction (degrees clockwise from true north), and
  • Wind speed (m/s).

Examinign water quality effects on hatchery production is an intuitive avenue of investigation, but in an effort to build predictive tools that may eventually lead to an early warning system for the hatchery, it was important for us to examine weather data and its effects on Choptank water quality and hatchery production.

During the data cleaning process, the following steps were taken to ensure data quality and completeness:

  • Retained only data with the highest quality codes (1 and 2).
  • Removed records of chlorophyll, dissolved oxygen, salinity, and turbidity outside the range of 0–50.
  • Aggregated the 10-minute data to obtain daily averages.
  • Used simple linear regressions between the buoy data and corresponding data measured at the NOAA’s station CAMM2 in Cambridge, MD, to fill in missing information on daily air and water temperatures where possible (adjusted coefficient of determination \(R^2_{adj.} > 0.95\) in both cases).
  • Used a simple linear regression between the buoy air temperature and air temperature measured at the Cambridge airport to fill in missing daily air temperatures where possible. (Linear regressions of Goose’s Reef wind speed and direction with corresponding observations measured at the station CAMM2 and the airport produced \(R^2_{adj.} < 0.7\) and consequently were not used to fill in the missing data.)
  • Filled in the remaining missing data for the Goose’s Reef using the iterative random forests algorithm implemented in the R package missForest (Stekhoven 2022). This algorithm implemented up to 100 random forests, each with 50 regression trees, and incorporated all available variables measured at the buoy, station CAMM2 (air and water temperatures, wind speed and direction), and the Cambridge airport (air temperature, wind speed and direction), along with the numeric year, month, and day of the year for time-series predictions.

By following these steps, we completed the dataset of environmental conditions directly measured near the HPLOH, incorporating both water quality and atmospheric information from the Goose’s Reef buoy.

Daymet gridded weather conditions

We used the R package daymetr (Hufkens 2023) to access the Daymet Version 4 dataset. Daymet offers gridded estimates of daily weather based on statistical modeling techniques that interpolate and extrapolate ground-based observations. For the 1 km \(\times\) 1 km grid corresponding to the location of the HPLOH, we obtained the following daily time series of weather variables:

  • Precipitation (mm),
  • Maximum temperature (°C), and
  • Minimum temperature (°C).

Additionally, we calculated the daily average temperature (°C) by taking the average of the minimum and maximum temperatures.

Remote sensing products

We used Google Earth Engine pipelines to collect daily time series of the following variables:

  • Normalized difference vegetation index (NDVI). We obtained NDVI measurements from both the Landsat-8 Operational Land Imager (OLI) and the Sentinel-2 MultiSpectral Instrument (MSI). The NDVI serves as an indicator of vegetation health and was used to estimate the potential impact of agricultural practices and other land uses on water quality. These measurements were aggregated across the larger Choptank River watershed and nearby subwatersheds. Specifically, we examined the changes (differences) in NDVI over various time intervals: 1, 2, 4, 8, 16, and 32 days. This resulted in seven NDVI variables that helped us understand the effects of agricultural practices, such as fertilizer and herbicide application.
  • Chlorophyll. We also considered chlorophyll measurements within a 1-km radius of the HPLOH (inlet) and in the Choptank River subwatershed used for calculating the NDVI. These two remotely sensed chlorophyll variables provided insights into the levels of chlorophyll-a, which is an indicator of phytoplankton abundance and water quality.

Winter averages

To consider the influence of winter conditions preceding the hatchery production seasons, we averaged the following daily variables for January–February of each year:

  • Air temperature (Goose’s Reef),
  • Chlorophyll concentration (Goose’s Reef),
  • Oxygen concentration (Goose’s Reef),
  • Salinity (Goose’s Reef),
  • Water temperature (Goose’s Reef),
  • Precipitation (Daymet), and
  • NDVI (remote sensing).

Overall, we considered 144 predictors, including the original 32 predictors and their lagged versions. Table 1 shows the statistical summaries, and Figure 1 shows pairwise plots of the original (not lagged) variables.

Table 1: Summary of the response variable (yield) and original numeric predictors.
unit n mean sd min max range se
Yield % 615 10.097 9.474 0.000 56.359 56.359 0.382
AvgShellHeightFemales mm 612 114.837 10.435 82.457 187.600 105.143 0.422
AvgShellHeightMales mm 608 112.614 15.712 57.556 296.319 238.764 0.637
Fecundity millons of eggs/female 614 13.685 7.371 0.000 48.100 48.100 0.297
GonadalIndex 1 to 4 556 2.552 0.259 1.500 3.250 1.750 0.011
Spawn_pH NBS units 459 7.796 0.242 6.701 8.305 1.604 0.011
Spawn_Salinity ppt 539 10.115 1.614 5.875 14.187 8.312 0.070
Spawn_Temp °C 539 19.742 1.918 15.674 27.640 11.966 0.083
Week 1 to 52 615 24.418 7.254 9.000 39.000 30.000 0.292
GR_AirTemp °C 615 20.645 6.709 -0.568 30.541 31.109 0.271
GR_Chlorophyll ug/l 615 8.866 5.968 0.619 47.057 46.438 0.241
GR_Oxygen mg/l 615 9.505 1.905 4.772 14.635 9.863 0.077
GR_Salinity ppt 615 10.685 2.433 2.596 17.059 14.463 0.098
GR_Turbidity NTU 615 4.667 5.425 0.315 41.114 40.799 0.219
GR_WaterTemp °C 615 21.326 7.058 0.369 30.342 29.973 0.285
GR_WindDirection degree 1° to 360° 615 182.762 55.159 48.720 322.340 273.620 2.224
GR_WindSpeed m/s 615 4.688 1.828 0.000 11.514 11.514 0.074
DM_Precip mm/day 615 4.641 10.383 0.000 99.230 99.230 0.419
DM_TempAvg °C 615 21.231 6.560 -1.305 31.050 32.355 0.265
DM_TempMax °C 615 26.276 6.535 1.620 36.510 34.890 0.264
DM_TempMin °C 615 16.187 6.862 -4.260 26.180 30.440 0.277
CHLct ug/l 615 33.865 14.721 6.790 83.061 76.271 0.594
CHLin ug/l 615 33.865 14.721 6.790 83.061 76.271 0.594
NDVI -1 to 1 615 0.476 0.222 -0.023 0.929 0.953 0.009
NDVI_d1 -1 to 1 615 0.001 0.077 -0.907 0.755 1.662 0.003
NDVI_d16 -1 to 1 615 0.042 0.209 -0.624 0.945 1.569 0.008
NDVI_d2 -1 to 1 615 0.006 0.122 -0.895 0.846 1.742 0.005
NDVI_d32 -1 to 1 615 0.079 0.225 -0.879 0.726 1.605 0.009
NDVI_d4 -1 to 1 615 0.006 0.157 -0.871 0.706 1.577 0.006
NDVI_d8 -1 to 1 615 0.023 0.187 -0.824 0.884 1.708 0.008
Winter_NDVI -1 to 1 615 0.270 0.082 0.168 0.458 0.291 0.003
Winter_Precip mm/day 615 3.075 0.697 1.676 4.002 2.326 0.028
Winter_WaterTemp °C 615 3.806 1.458 1.696 6.109 4.413 0.059
Winter_Salinity ppt 615 13.657 1.496 9.669 15.619 5.950 0.060
Winter_Oxygen mg/l 615 12.817 0.391 12.343 13.602 1.259 0.016
Winter_Chlorophyll ug/l 615 7.831 2.724 2.129 11.102 8.973 0.110
Winter_AirTemp °C 615 2.872 1.954 -0.404 5.983 6.387 0.079

Figure 1: Pairwise scatterplots of each of the variables (horizontal axis) with yield (vertical axis).

Additionally, Figure 2 shows how some of the environmental variables relate to those measured at the hatchery. We observe correlations between the temperatures (ambient water and air temperatures are correlated with water temperature at the hatchery – Figure 2 A–C) and water salinities (Figure 2 D), but no visible relationship is observed between precipitation and hatchery salinity (Figure 2 E).

Figure 2: Correspondence of some of the environmental variables from the Goose’s Reef (GR) and Daymet (DM) datasets with conditions at the hatchery.

3 Machine learning methods

In our study, we employed data-driven machine learning methods to investigate the impact of various variables on hatchery production yield and to identify key predictors. The main method, which produced a predictive model, was the random forest. To select relevant variables and appropriate structure of the random forest model, we tried several variable selection techniques and combinations of random forest parameters. The technique for picking important variables and the model parameters were selected by assessing the resulting model accuracy on data that were deliberately left out (i.e., we cross-validated the selections we made). Finally, we used Shapley values to interpret the effects of individual variables represented in the model and decompose individual predictions of hatchery yields. The subsections below describe each of the methods in more detail and name the software packages we used. Overall, we used R 4.2.2 (R Core Team 2022) for the computations, and our full code can be accessed on GitHub.

3.1 Random forest

Random forest (Breiman 2001) is a versatile approach that builds predictive models for hatchery yields, given a training dataset with observed yields (response variable) and potentially related variables (predictors). The random forest algorithm picks a predictor and its value that splits the response values into the most distinct of homogeneous subsets (to minimize variance around the group means). Consecutive splits can be done using a different predictor or just using a different threshold of the same predictor. The resulting sequences of splits have a shape of a tree, where the root is the whole input dataset and leaves are the ultimate subsets of the data that do not get split anymore (the minimal size of these subsets is usually called the nodesize). Average response values in the leaves are used as the tree predictions. For example, to obtain yield predictions from a tree, given a certain combination of predictors, the algorithm finds a leaf that corresponds to this combination and uses the average yield in that leaf as the prediction.

To grow multiple trees from the same input dataset and to make the trees more different from one another, each tree is grown on a bootstrap sample (sample with replacement from the original data), and only a random subset of size mtry of all the predictors is considered when selecting each new split in a tree. Predictions from all trees in the random forest are averaged to obtain a single prediction for a given combination of input variables. While using simple randomization and multiple simple splits, random forest is capable of modeling complex relationships between variables and providing accurate predictions.

An advantage of the random forest algorithm is the low number of parameters that need to be tuned to improve the model structure and predictive performance. Most often, mtry and nodesize are tuned, while default values can be used for other parameters of the algorithm, for example,

  • the size of the bootstrap samples (we used the default, equal to the sample size),
  • case weights (we weighted all observations equally), and
  • number of trees (we used the default of 500 since by reaching this number of trees the model errors plateaued and did not decrease further).

The nodesize controls the depth of the trees: smaller nodesize results in deeper and possibly overfitted trees (overfitted models provide a close fit to the data but often have poor performance on a new testing dataset), while larger nodesize results in shallower trees with better out-of-sample performance. However, random forest aggregates predictions from many overfitted trees, hence the effect of nodesize is often modest, and full trees can be grown. By default, nodesize = 5 when modeling a numeric response variable.

Low mtry corresponds to more different (decorrelated) trees, which results in lower variance but a higher bias of the random forest. In turn, high mtry increases the chance of the most relevant variable to be picked when growing the trees, which reduces the bias but increases the variance of the ensemble (Hastie et al. 2009). The default value of mtry is (rounded down) \(p/3\) or \(\sqrt{p}\) in different software packages, where \(p\) is the total number of predictors.

The mtry, nodesize, and other parameters mentioned here are sometimes called the ‘tuning parameters’ or ‘hyperparameters’ because they are set by the user and are not estimated from the input data directly (for example, compared to coefficients in statistical regression models). In this study, we used the cross-validation technique described below to consider several combinations of these parameters and make a data-informed decision. Specifically, we considered mtry of 5, 7, 10, 15, 20, 25, 30, 40, and 50; and nodesize of 1, 3, 5, and 10. We used the R package ranger (Wright et al. 2023) to train these random forest models.

3.2 Identification of important predictors (a.k.a. variable selection)

We apply variable selection techniques to improve performance of the random forest model and, as a result, identify important predictors of hatchery yield. If among the \(p\) original predictors there are many irrelevant ones, it leads to a smaller chance for at least one relevant predictor to be picked in the random subset of mtry predictors when creating a new tree split (Hastie et al. 2009). Reducing the number of irrelevant predictors ensures that most subsets of size mtry will contain at least one relevant predictor for growing the tree. However, it is not required to get rid of absolutely all irrelevant predictors since they might not be picked for making a split at all, as long as there is a better alternative in the mtry set (Hastie et al. 2009).

The fact that the final random forest model might not employ all of the input variables and the modeled relationships are not expressed in a single coefficient (like it is possible in statistical linear regression models) is due to the black-box nature of the method. We address it using the model interpretation techniques discussed below.

The importance of a predictor in a random forest is usually quantified by the average increase of prediction errors given the values of this predictor were permuted. If the predictor is important, then random permutations of its values will worsen the prediction accuracy more than if the predictor was not important. However, there is no inherent threshold to separate the importance values into statistically significant or not, hence additional methods have been developed.

From a variety of methods for variable selection, we applied two best-performing algorithms developed specifically for random forest models: Boruta (Kursa and Rudnicki 2010) and interaction forest (Hornung and Boulesteix 2022). We also applied each of these algorithms recursively (until no irrelevant predictors were identified) and compared to the baseline when all predictors were used, without selection. This gave us five options for variable selection that we compared in a cross-validation study.

To decide on the significance of random forest predictors, the Boruta algorithm creates permuted copies of the original predictors. Then, the algorithm calculates the importances of these ‘shadow’ predictors and groups them into maximal, mean, and minimal based on the obtained values. Since the shadow predictors are inherently irrelevant, as they are just random permutation, their importances represent the importance distribution of irrelevant predictors. The algorithm compares the maximal shadow importances with importances of real predictors, using the pairwise t-test adjusted for the number of comparisons. In our single implementation of Boruta, we kept only those predictors that had significantly higher performances than the maximal shadow ones (i.e., removed predictors with importances below or not significantly different from the shadow ones). In our recursive implementation of Boruta, we removed only rejected predictors (i.e., those with importances significantly lower than the shadow ones) and repeated the selection until no more predictors could be rejected. We used the R package Boruta (Kursa and Rudnicki 2022) with its default settings for applying this algorithm.

The algorithm of interaction forest presents a new effect importance measure (EIM) that allows us to rank the importances of individual predictors (similar to the more traditional approaches described above) and to consider bivariate splits (interaction effects) and their importance. In our simple implementation of interaction forest, all original predictors were used, without variable selection, hence only the selection of splits differed from the conventional random forest. In our recursive implementation of the interaction forest, we removed predictors with univariate EIM < 0 and retrained the interaction forest. We also considered the option of keeping predictors that resulted in bivariate EIM > 0 (positive importance of interaction effects), but it did not result in reducing the number of predictors from the original number. We used the R package diversityForest (Hornung and Wright 2023) to train the interaction forests.

3.3 Cross-validation for performance evaluation

To compare different options of random forest algorithms and variable selection and select one for the final model, we used cross-validation on the available data. In cross-validation, only a portion of the data is used for training the model (a.k.a. training set) while a hold-out portion (a.k.a. validation set) is treated as new data that can be used to assess real-life predictive performance of the model.

Specifically, we used \(k\)-fold cross-validation, where the input cases are randomized and separated into \(k\) equal folds (subsets). We can use the folds \(2, \ldots, k\) to train the model, and predictors from fold 1 to make yield forecasts for fold 1 (the validation fold). The procedure is repeated until all \(k\) folds are used as validation folds. We used \(k = 10\) and the R package caret (Kuhn 2023) to separate the data into folds so the same randomization is used to compare competing options.

The forecasts on the validation folds are considered out-of-sample forecasts since the corresponding data were not used for model specification and training. Let \(e_i\) denote the cross-validation error, which is the difference between true yield (\(y_i\)) and the forecast (\(\hat{y}_i\)): \[ e_i = y_i - \hat{y}_i, \] where \(i = 1, 2, \ldots, n\) and \(n\) is the sample size. We evaluated the performance of alternative models using three metrics: mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (\(R^2\)), which were calculated as follows: \[ MAE = n^{-1} \sum |e_i|, \]

\[ RMSE = (n^{-1} \sum e_i^2)^{1/2}, \]

\[ R^2 = 1 - \frac{\sum e_i^2}{\sum (y_i - \bar{y})^2}, \] where \(\bar{y} = n^{-1} \sum y_i\) is the average observed yield. Smaller errors (MAE and RMSE) are preferred, but RMSE penalizes large individual errors \(e_i\) more than MAE. Higher \(R^2\) is preferred as it shows the percentage of yield variability captured with the forecasts. The performance metrics MAE, RMSE, and \(R^2\) generally provided consistent results, but \(R^2\) was used for making final selections.

To identify statistically significant differences between different algorithms, we also used paired \(t\)-tests to compare average performance metrics from \(k\) cross-validation folds. The \(p\)-values from these tests were adjusted for the number of pairwise comparisons using the Benjamini–Hochberg procedure to control the false discovery rate (Benjamini and Hochberg 1995). The tests with the adjustment for multiple testing were implemented using the R packages stats (R Core Team 2022) and rstatix (Kassambara 2023a), and visualized with boxplots using the R package ggpubr (Kassambara 2023b).

In this study, we first ran a cross-validation to compare the algorithms of variable and split selection, from five alternatives (see Section 3.2). For this first round of cross-validation, we used a set of records for which all the predictors were available (see Table 1 for the number of available values for each variable). Given the selected variables, we were able to redefine the dataset if some of the predictors with missing values were removed. Hence, for tuning the nodesize and mtry (see Section 3.1) in the second round of cross-validation, we used a dataset with complete observations that was not less than the original one.

3.4 Model interpretation

Model interpretation techniques enhance our understanding of black-box models by providing insights into the relationships learned by the model and which of the input variables affected the predictions most. Partial dependence plots are a typical way to demonstrate pairwise predictor-response relationships represented in a random forest (Hastie et al. 2009), however, these plots do not provide interpretations to individual predictions. In this study, we use another method, based on Shapley decomposition explaining the contributions of individual predictors (Molnar 2022) and adapted to compute fast (approximate) Shapley values to explain individual predictions in tree-based models like the random forest Štrumbelj and Kononenko (2014).

The SHAP (SHapley Additive exPlanations) values \(\phi_j\) by Lundberg et al. (2020) have the additivity property such that the forecast of yield for a given vector of inputs \(x\) can be decomposed as a sum (Molnar 2022): \[ \hat{y}(x) = \phi_0 + \sum \phi_j, \] where \(\hat{y}(x)\) is the forecast from the random forest model, \(\phi_0 = \mathrm{E}(\hat{y}(X))\) is the baseline set to the average prediction across all cases, and \(j\) is from 1 to the number of predictors \(p\).

First, we use average absolute values of SHAP, denoted as mean(|SHAP value|) in the figures below, to rank the importance of predictors, analogous to using the permutation-based importance in random forest. The SHAP importance shows how much each variable affects the predictions in the dataset, on average.

Second, we provide SHAP-based partial dependence plots. Each of the plots shows SHAP values for a given predictor, hence demonstrating the general pattern of the relationships (similar to the partial dependence plots) and variability.

Third, we summarize the explained effects for subsets of the cases. We are particularly interested in the well-predicted cases (i.e., where the model accuracy was high, with the absolute error below 3 percentage points). To analyze environmental and hatchery conditions responsible for the crashes or high yields, we select well-predicted crashes or well-predicted yields, correspondingly.

To compute the Shapley additive explanations, we used the R package fastshap (Greenwell 2023) and the package shapviz (Mayer 2023) to visualize the results.

4 Results

4.1 Variable selection

From Figure 3, variable selection using the algorithms Boruta, recursive Boruta, and recursive interaction forest led to improved performance (lower errors and higher \(R^2\)) compared with not using a selection algorithm (i.e., using all the variables) in a random forest. For the \(p\)-values of the tests comparing the performances of different algorithms, see Section 6.1. The recursive implementation of Boruta consistently resulted in the most improved performance compared to using all the variables, however, the performance of recursive Boruta was not significantly different than that of Boruta or recursive interaction forest (Figure 3). Recursive Boruta generally selected fewer variables than the recursive interaction forest but more than Boruta (Figure 4). The interaction forest (non-recursive) resulted in using all variables (Figure 4) and hence its results were not different from using all variables altogether (Figure 3). Overall, we proceed with using the recursive implementation of the Boruta algorithm for selecting relevant predictors in the model.

Figure 3: Performance metrics from 10-fold cross-validation evaluating the efficiency of variable selection algorithms (Boruta, interaction forest, and their recursive options) compared to using all available predictors. A) Mean absolute error (MAE). B) Root mean square error (RMSE), and C) coefficient of determination (\(R^2\)) from the 10 validation subsets. The stars denote pairs with significant differences at the levels 0.05\(^{*}\), 0.01\(^{**}\), 0.001\(^{***}\), or 0.0001\(^{****}\) after adjusting the \(p\)-values for the number of paired tests.

Figure 4: Box-plots for the number of variables selected by each algorithm in 10-fold cross-validation.

The recursive Boruta algorithm, when applied to the whole dataset, selected 56 predictors that we are using in further analysis (Figure 5).

Figure 5: Variable selection with recursive Boruta (results from the last iteration of the algorithm). Green boxes correspond to predictors confirmed as important; yellow boxes correspond to predictors without a decision (treated as important), and blue boxes combine importance information (min, mean, and max) from shadow predictors obtained by permuting the original predictors.

4.2 Summary of the model

From Figure 3, random forest models using the recursive Boruta algorithm for selecting variables achieve MAE of 6 percentage points, RMSE of 7.9 percentage points, and \(R^2\) of 27.9% on out-of-sample data (data left out when constructing the model for its validation).

With the selected variables, and a refined dataset with all the 56 selected predictors present (\(n =\) 459, which may differ from the summary reported in Table 1), we further improved the model by tuning its hyperparameters. The resulting model was selected based on the highest \(R^2\) in 10-fold cross-validation and used mtry of 20 randomly selected predictors at each split of a tree to separate the data into bins as small as \(nodesize =\) 5 observations.

The cross-validation \(R^2\) of this tuned model was 34.5%, MAE of 5.7 percentage points and RMSE of 7.7 percentage points. The average predicted yield from this model (9.4%) was close to the average observed yield (9.3%) in the dataset (Figure 6).

The in-sample \(R^2\) of the tuned model was 87.8%, MAE of 2.4 percentage points and RMSE of 3.3 percentage points. I.e., the model explained 87.8% of the hatchery yield variability in the sample.

Figure 6: Observed and predicted yields by the tuned random forest model. The solid black line denotes 1:1 match or ideal forecast.

4.3 Shapley

4.3.1 Global

Using Shapley values, we provide model insights. Figure 7 compares average magnitudes of the effects by each variable, showing the week number, winter NDVI, winter salinity, Fecundity, and recent ambient salinity and turbidity being among the top determinants of hatchery yields.

Figure 7: Shapley-based feature importance scores for the whole dataset (higher values correspond to more important variables). The horizontal axis shows average absolute impact of each predictor on the model output (hatchery yield, %). Only the most important predictors are shown.

Figure 8 further reveals global patterns and individual variability of the effects by the top variables. It shows that higher yields are typically observed for broods started before week 30 (end of July), after a winter with high water salinity and low NDVI, with smaller fecundity at spawning, ambient salinity above 10, and low ambient turbidity. Note that some of the effects are more neutral than other. For example, starting the production during weeks 10–15 does not necessarily lead to a big improvement because Shapley values for those weeks are around 1, but starting late in the year such as week 40 may lead to yields 4–6 percentage points lower than the baseline. Similarly, turbidity above 2 generally corresponds to lower yields, but turbidity below 2 corresponds to yield improvements as high as 3 percentage points. See Figure 13 for partial dependence plots for all the variables.

Figure 8: Shapley-based partial dependence plots for the most important variables. The plots show how a variable’s value (horizontal axis) impacts the predicted yield (vertical axis) of every sample (each dot) in the dataset.

4.3.2 Crashes

We further investigated model insights for the cases of hatchery crashes. We selected 108 well-defined crashes (cases with yields below 1% and absolute errors of the model below 3 percentage points). For these cases, Figure 9 shows week and winter NDVI are still the two most important variables, but the next five important variables relate to water salinity, such as the ambient winter salinity and salinity before filling up the tank, and conditioning salinity. Note that the Shapley values are additive such that their sum matches the model output. This property is demonstrated in Figure 10 that explains how the model created each of the predicted values \(f(x)\) given that the baseline average predicted yield was 9.4%.

Figure 9: Shapley-based feature importance scores for accurately predicted crashes (higher values correspond to more important variables). Only the most important predictors are shown.

Figure 10: Contributions of predictors to several of the well-predicted crashes. The value \(E[f(x)]\) is the average predicted value for all cases in the dataset (the baseline), while \(f(x)\) is the predicted value for the given case. The rows show the largest impacts of actual predictors (features) for each case and aggregated impact of less influential predictors.

4.3.3 Highest yields

Similar to the previous section, we selected cases of well-defined high yields, with yields above 20% and absolute error of the model below 3 percentage points (17 cases). For these cases, the rankings of influential variables (Figure 11) appear to be considerably perturbed compared to the rankings in Figure 7 and Figure 9. Thus, Figure 11 shows several turbidity indicators as more important, along with fecundity, NDVI, and pH. This might be related to the fact that these cases of high yields already correspond to the favorable season with weeks from 15 to 29 (average 23) and lower winter NDVI from 0.22 to 0.26 (average 0.24). Figure 8 shows that the Shapley effects at these intervals of predictors are positive but not exceptionally high. Figure 12 gives several example explanations of how the model came up with the accurate predictions of high yields.

Figure 11: Shapley-based feature importance scores for accurately predicted high yields (higher values correspond to more important variables). Only the most important predictors are shown.

Figure 12: Contributions of predictors to several of the well-predicted high yields. The value \(E[f(x)]\) is the average predicted value for all cases in the dataset (the baseline), while \(f(x)\) is the predicted value for the given case. The rows show the largest impacts of actual predictors (features) for each case and aggregated impact of less influential predictors.

5 Conclusion

This analysis demonstrated one of the possible pathways of using observational data to study the effects of factors for improving oyster hatchery production. By applying machine learning to the data, we were able to process a large number of predictors, build and evaluate a variety of models, and extract data patterns that helped the model to achieve accurate predictions of yields. From these patterns, we can identify which conditions lead to lower yields and which hatchery should try to avoid (e.g., starting production late in the season or at low salinity) as well as conditions corresponding to improved yields (e.g., low fecundity and turbidity).

Acknowledgements

This research was supported by an award from the Chesapeake Bay Trust and Chesapeake Oyster Alliance.

 

References

Barton A, Hales B, Waldbusser GG, et al (2012) The Pacific oyster, Crassostrea gigas, shows negative correlation to naturally elevated carbon dioxide levels: Implications for near-term ocean acidification effects. Limnology and Oceanography 57:698–710. https://doi.org/10.4319/lo.2012.57.3.0698
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57:289–300. https://www.jstor.org/stable/2346101
Breiman L (2001) Random forests. Machine Learning 45:5–32. https://doi.org/10.1023/A:1010933404324
Ekstrom JA, Suatoni L, Cooley SR, et al (2015) Vulnerability and adaptation of US shellfisheries to ocean acidification. Nature Climate Change 5:207–214. https://doi.org/10.1038/nclimate2508
Estes RM, Friedman CS, Elston RA, Herwig RP (2004) Pathogenicity testing of shellfish hatchery bacterial isolates on Pacific oyster Crassostrea gigas larvae. Diseases of Aquatic Organisms 58:223–230. https://doi.org/10.3354/dao058223
Gray MW, Alexander ST, Beal BF, et al (2022) Hatchery crashes among shellfish research hatcheries along the Atlantic coast of the United States: A case study of production analysis at Horn Point Laboratory. Aquaculture 546:737259. https://doi.org/10.1016/j.aquaculture.2021.737259
Greenwell B (2023) Fastshap: Fast approximate shapley values. Https://github.com/bgreenwell/fastshap
Hastie TJ, Tibshirani RJ, Friedman JH (2009) The elements of statistical learning: Data mining, inference, and prediction, 2nd edn. Springer, New York
Hornung R, Boulesteix A-L (2022) Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460. https://doi.org/10.1016/j.csda.2022.107460
Hornung R, Wright MN (2023) diversityForest: Innovative complex split procedures in random forests through candidate split sampling. R package version 0.4.0
Hufkens K (2023) Daymetr: Interface to the daymet web services. R package version 1.7, https://github.com/bluegreen-labs/daymetr
Jones JB (2006) Why won’t they grow? – Inhibitory substances and mollusc hatcheries. Aquaculture International 14:395–403. https://doi.org/10.1007/s10499-005-9040-z
Kassambara A (2023b) Ggpubr: ggplot2 based publication ready plots. R package version 0.6.0, https://rpkgs.datanovia.com/ggpubr/
Kassambara A (2023a) Rstatix: Pipe-friendly framework for basic statistical tests. R package version 0.7.2, https://rpkgs.datanovia.com/rstatix/
Kuhn M (2023) Caret: Classification and regression training. R package version 6.0-94, https://github.com/topepo/caret/
Kursa MB, Rudnicki WR (2010) Feature selection with the Boruta package. Journal of Statistical Software 36:1–13
Kursa MB, Rudnicki WR (2022) Boruta: Wrapper algorithm for all relevant feature selection. R package version 8.0.0, https://gitlab.com/mbq/Boruta/
Lundberg SM, Erion G, Chen H, et al (2020) From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence 2:56–67. https://doi.org/10.1038/s42256-019-0138-9
Mayer M (2023) Shapviz: SHAP visualizations. R package version 0.9.1, https://github.com/ModelOriented/shapviz
Molnar C (2022) Interpretable machine learning: A guide for making black box models explainable, 2nd edn. https://christophm.github.io/interpretable-ml-book
R Core Team (2022) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Stekhoven DJ (2022) missForest: Nonparametric missing value imputation using random forest. Https://www.r-project.org
Štrumbelj E, Kononenko I (2014) Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems 41:647–665. https://doi.org/10.1007/s10115-013-0679-x
Walker T (2017) Seed supply a challenge for North American oyster producers. Hatchery International. https://www.hatcheryinternational.com/seed-supply-a-challenge-for-north-american-oyster-producers-1241/
Wright MN, Wager S, Probst P (2023) Ranger: A fast implementation of random forests. R package version 0.15.1, https://github.com/imbs-hl/ranger

6 Appendix

6.1 Pairwise comparisons of the cross-validation results

Group averages of MAE:

#>                  All               Boruta            BorutaRec 
#>             6.194901             5.931658             5.964305 
#>    InteractionForest InteractionForestRec 
#>             6.257320             5.945040

Group averages of RMSE:

#>                  All               Boruta            BorutaRec 
#>             8.075456             7.912044             7.918358 
#>    InteractionForest InteractionForestRec 
#>             8.131388             7.877465

Group averages of \(R^2\):

#>                  All               Boruta            BorutaRec 
#>            0.2509084            0.2788968            0.2794493 
#>    InteractionForest InteractionForestRec 
#>            0.2420665            0.2866606

Group averages of number of variables selected:

#>                  All               Boruta            BorutaRec 
#>                144.0                 45.2                 52.1 
#>    InteractionForest InteractionForestRec 
#>                144.0                 56.1

Pairwise comparisons:

#> 
#>  Pairwise comparisons using paired t tests 
#> 
#> data:  MAE and Version 
#> 
#>                      All    Boruta BorutaRec InteractionForest
#> Boruta               0.0012 -      -         -                
#> BorutaRec            0.0012 0.2441 -         -                
#> InteractionForest    0.2083 0.0012 0.0012    -                
#> InteractionForestRec 0.0124 0.8543 0.8543    0.0012           
#> 
#> P value adjustment method: BH
#> 
#>  Pairwise comparisons using paired t tests 
#> 
#> data:  RMSE and Version 
#> 
#>                      All   Boruta BorutaRec InteractionForest
#> Boruta               0.034 -      -         -                
#> BorutaRec            0.016 0.904  -         -                
#> InteractionForest    0.322 0.061  0.029     -                
#> InteractionForestRec 0.034 0.794  0.742     0.016            
#> 
#> P value adjustment method: BH
#> 
#>  Pairwise comparisons using paired t tests 
#> 
#> data:  R2 and Version 
#> 
#>                      All   Boruta BorutaRec InteractionForest
#> Boruta               0.047 -      -         -                
#> BorutaRec            0.016 0.957  -         -                
#> InteractionForest    0.420 0.085  0.041     -                
#> InteractionForestRec 0.047 0.756  0.756     0.016            
#> 
#> P value adjustment method: BH

6.2 Partial plots

Figure 13: Shapley-based partial dependence plots for all variables. The plots show how a variable’s value (horizontal axis) impacts the predicted yield (vertical axis) of every sample (each dot) in the dataset. The variables are listed in the decreasing order of their global Shapley importance (most important first).

Citation

BibTeX citation:
@report{lyubchich2023,
  author = {Vyacheslav Lyubchich and Matthew W. Gray and Greg M. Silsbe},
  publisher = {University of Maryland Center for Environmental Science},
  title = {Machine Learning of Factors for Improving Oyster Hatchery
    Production},
  date = {2023-09-07},
  address = {Cambridge, Maryland, USA},
  url = {https://vlyubchich.github.io/hatchery_rf},
  langid = {en}
}
For attribution, please cite this work as:
Vyacheslav Lyubchich, Matthew W. Gray, Greg M. Silsbe (2023) Machine learning of factors for improving oyster hatchery production. University of Maryland Center for Environmental Science, Cambridge, Maryland, USA. https://vlyubchich.github.io/hatchery_rf